Customer Analytics Case Study

Dataset Description

Cancel Table:

  1. userid - Fake userids and made them email addresses for this one.
  2. orderid - Unique order # for a given purchase. One customer may have many orders.
  3. customer_since - Date on which the customer first became a paying customer.
  4. country - Home country of the customer.
  5. amount - Order amount, which is typically monthly recurring revenue (MRR).
  6. Client Status - Current state of customer account. There are three potential values: Active, Inactive, and Closed. The latter two are treated the same for this.
  7. Order Status - Current state of order.
  8. billingcycle - Indicates whether this is monthly, quarterly, semi-annual, or annual payments. As noted above on the Amount field, the vast majority of customers pay monthly.
  9. product - We have a variety of product types, and I simply made this fake data by converting each to a numeric value.
  10. cancel_date - Self explanatory
  11. Cancel Count (45 Days) - Count of an individual customer's orders that were cancelled in the last 45 days.
  12. Active Order Count - Quantity of open orders for a single customer.
  13. Recurring Amt - A summation of all active order $ amounts for a given customer. It is worth noting that this particular dataset may not have all orders for a customer, so if you sum up the values in the Amount field they are likely not going to total to the Recurring Amt field.
  14. Customer Type - Indicates whether this customer ever actually paid us any money. Many sign up for trials and then do not convert (which is one of the things I am hoping we can dive into deeper to understand why) and these are marked as 'Trial', whereas all those marked as 'Paying' either converted from trials or directly signed up and skipped the trial process.

Package Table:

  1. Customer - This correlates to the customer ID in the Cancel table, although you would need to extract it from the email address.
  2. Name - See note below. In short, this field has confused me for a while so you can likely disregard much of it.
  3. Category - We sell three types of proxies: Dedicated (meaning a single customer has use), Semi-dedicated (meaning that multiple customers use the same one, but they get to pay a cheaper price as a result), and Rotating (meaning that they pay a set $ amount, and on a periodic basis that is usually around 10 minutes, they rotate amongst a pool of proxies).
  4. Country - Our proxies IP addresses are in data centers throughout the world, so this is the location of that IP address.
  5. Type - I'm actually not sure why this field sometimes has a combination of country and category, and usually just says standard.
  6. Category - Sorry, I should have labeled these better. The first Category field above is from a proxy packages table, and this one in the far right column is from the history table. So for cancellations, this one is going to be the better to use.

Data Profiling

In [2]:
# loading libraries
import pandas as pd
import numpy as np
import re

import pandasql
from pandasql import sqldf

import matplotlib.pyplot as plt
import seaborn as sns

import plotly
import plotly.express as px
import plotly.graph_objects as go

from sklearn.cluster import KMeans
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from feature_engine.categorical_encoders import OrdinalCategoricalEncoder
from feature_engine.variable_transformers import LogTransformer
from catboost import CatBoostClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold

import pycountry

import warnings
warnings.filterwarnings('ignore')
In [46]:
# loading cancel table
cancel_table = pd.read_csv("D:\\EverythingDS\\DataSets\\DataAnalystCaseStudyData-canceltable.csv")
cancel_table.head()
Out[46]:
userid orderid country customer_since amount Client Status Order Status billingcycle product cancel_date Cancel Count (45 Days) Active Order Count Recurring Amt Customer Type
0 dummy215842@sprious.com 112183 US 8/14/2018 71.5 Inactive Cancelled Monthly Product 1 7/22/2019 0:16 NaN 0 0.0 Paying
1 dummy237318@sprious.com 129059 ES 11/24/2018 26.0 Inactive Cancelled Monthly Product 2 7/22/2019 1:35 NaN 0 0.0 Paying
2 dummy264667@sprious.com 152715 US 4/24/2019 25.2 Inactive Cancelled Monthly Product 3 7/22/2019 1:45 NaN 0 0.0 Paying
3 dummy255378@sprious.com 165297 BR 2/28/2019 3.0 Inactive Cancelled Monthly Product 4 7/22/2019 6:05 NaN 0 0.0 Paying
4 dummy271219@sprious.com 157789 UA 5/28/2019 310.0 Inactive Cancelled Monthly Product 5 7/22/2019 7:48 NaN 0 0.0 Paying
In [9]:
# loading package table
package_table = pd.read_csv("D:\\EverythingDS\\DataSets\\DataAnalystCaseStudyData-packagetable.csv")
package_table.head()
Out[9]:
CustomerID name category type country category.1
0 35200 10 Ports Monthly rotate de-rotate de rotate
1 47037 10 ports Monthly static us-static us static
2 51755 5 Port Proxy Plan semi-3 us-semi-3 us semi-3
3 55801 25 Ports Monthly static us-static us static
4 65531 10 ports Monthly static us-static us static
In [10]:
# cancel table info
cancel_table.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5733 entries, 0 to 5732
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   userid                  5733 non-null   object 
 1   orderid                 5733 non-null   int64  
 2   country                 5733 non-null   object 
 3   customer_since          5733 non-null   object 
 4   amount                  5733 non-null   float64
 5   Client Status           5733 non-null   object 
 6   Order Status            5733 non-null   object 
 7   billingcycle            5733 non-null   object 
 8   product                 5733 non-null   object 
 9   cancel_date             5733 non-null   object 
 10  Cancel Count (45 Days)  998 non-null    float64
 11  Active Order Count      5733 non-null   int64  
 12  Recurring Amt           5733 non-null   float64
 13  Customer Type           5733 non-null   object 
dtypes: float64(3), int64(2), object(9)
memory usage: 627.2+ KB
In [11]:
# package table info
package_table.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7569 entries, 0 to 7568
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   CustomerID  7569 non-null   int64 
 1   name        7569 non-null   object
 2   category    7569 non-null   object
 3   type        7569 non-null   object
 4   country     7569 non-null   object
 5   category.1  7569 non-null   object
dtypes: int64(1), object(5)
memory usage: 354.9+ KB
In [12]:
# checking null values in cancel table
cancel_table.isnull().sum()
Out[12]:
userid                       0
orderid                      0
country                      0
customer_since               0
amount                       0
Client Status                0
Order Status                 0
billingcycle                 0
product                      0
cancel_date                  0
Cancel Count (45 Days)    4735
Active Order Count           0
Recurring Amt                0
Customer Type                0
dtype: int64
In [13]:
# cheking null values in package table
package_table.isnull().sum()
Out[13]:
CustomerID    0
name          0
category      0
type          0
country       0
category.1    0
dtype: int64

Data Cleaning

Cancel Table

In [47]:
# checking for duplicate rows
cancel_duplicates = cancel_table[cancel_table.duplicated()]
cancel_duplicates
Out[47]:
userid orderid country customer_since amount Client Status Order Status billingcycle product cancel_date Cancel Count (45 Days) Active Order Count Recurring Amt Customer Type
4952 dummy236464@sprious.com 222087 AT 11/20/2018 200.0 Active Cancelled Monthly Product 36 6/8/2020 5:27 5.0 3 3150.0 Paying
4953 dummy236464@sprious.com 222087 AT 11/20/2018 200.0 Active Cancelled Monthly Product 36 6/8/2020 5:27 5.0 3 3150.0 Paying
In [48]:
# checking redundant data in orderid
order_duplicates = cancel_table.groupby("orderid").size().reset_index()
order_duplicates[order_duplicates[0] > 1]
Out[48]:
orderid 0
5227 222087 4
In [49]:
# checking the redundant orderid in the cancel_table
cancel_table[cancel_table["orderid"] == 222087]
Out[49]:
userid orderid country customer_since amount Client Status Order Status billingcycle product cancel_date Cancel Count (45 Days) Active Order Count Recurring Amt Customer Type
4950 dummy236464@sprious.com 222087 AT 11/20/2018 200.0 Active Cancelled Monthly Product 36 6/8/2020 5:26 5.0 3 3150.0 Paying
4951 dummy236464@sprious.com 222087 AT 11/20/2018 200.0 Active Cancelled Monthly Product 36 6/8/2020 5:27 5.0 3 3150.0 Paying
4952 dummy236464@sprious.com 222087 AT 11/20/2018 200.0 Active Cancelled Monthly Product 36 6/8/2020 5:27 5.0 3 3150.0 Paying
4953 dummy236464@sprious.com 222087 AT 11/20/2018 200.0 Active Cancelled Monthly Product 36 6/8/2020 5:27 5.0 3 3150.0 Paying
In [50]:
# getting rid of the redundant rows
cancel_table = cancel_table.drop_duplicates(subset = "orderid", keep = "first")
In [51]:
# extracting userid from email
cancel_table["userid"] = cancel_table["userid"].str.extract('(\d+)')
cancel_table.head()
Out[51]:
userid orderid country customer_since amount Client Status Order Status billingcycle product cancel_date Cancel Count (45 Days) Active Order Count Recurring Amt Customer Type
0 215842 112183 US 8/14/2018 71.5 Inactive Cancelled Monthly Product 1 7/22/2019 0:16 NaN 0 0.0 Paying
1 237318 129059 ES 11/24/2018 26.0 Inactive Cancelled Monthly Product 2 7/22/2019 1:35 NaN 0 0.0 Paying
2 264667 152715 US 4/24/2019 25.2 Inactive Cancelled Monthly Product 3 7/22/2019 1:45 NaN 0 0.0 Paying
3 255378 165297 BR 2/28/2019 3.0 Inactive Cancelled Monthly Product 4 7/22/2019 6:05 NaN 0 0.0 Paying
4 271219 157789 UA 5/28/2019 310.0 Inactive Cancelled Monthly Product 5 7/22/2019 7:48 NaN 0 0.0 Paying
In [58]:
# converting from datetime to date
cancel_table["cancel_date"] = pd.to_datetime(cancel_table["cancel_date"])
cancel_table["cancel_date"] = cancel_table["cancel_date"].dt.date
cancel_table["cancel_date"] = pd.to_datetime(cancel_table["cancel_date"])
cancel_table.head()
Out[58]:
userid orderid country customer_since amount Client Status Order Status billingcycle product cancel_date Cancel Count (45 Days) Active Order Count Recurring Amt Customer Type period
0 215842 112183 US 2018-08-14 71.5 Inactive Cancelled Monthly Product 1 2019-07-22 NaN 0 0.0 Paying 342.0
1 237318 129059 ES 2018-11-24 26.0 Inactive Cancelled Monthly Product 2 2019-07-22 NaN 0 0.0 Paying 240.0
2 264667 152715 US 2019-04-24 25.2 Inactive Cancelled Monthly Product 3 2019-07-22 NaN 0 0.0 Paying 89.0
3 255378 165297 BR 2019-02-28 3.0 Inactive Cancelled Monthly Product 4 2019-07-22 NaN 0 0.0 Paying 144.0
4 271219 157789 UA 2019-05-28 310.0 Inactive Cancelled Monthly Product 5 2019-07-22 NaN 0 0.0 Paying 55.0
In [59]:
# obtaining the period of the customer's stay with the company
cancel_table["customer_since"] = pd.to_datetime(cancel_table["customer_since"])
cancel_table["period"] = cancel_table["cancel_date"] - cancel_table["customer_since"]
cancel_table["period"] = cancel_table["period"] / np.timedelta64(1, 'D')
cancel_table["period"].head()
Out[59]:
0    342.0
1    240.0
2     89.0
3    144.0
4     55.0
Name: period, dtype: float64
In [60]:
# checking for different values within order status
cancel_table["Order Status"].value_counts()
Out[60]:
Cancelled    5532
Active        196
Name: Order Status, dtype: int64
In [61]:
# getting rid of pending and fraud type due to the shortage of data to be considered significant
cancel_table = cancel_table[(cancel_table["Order Status"] != "Pending") & (cancel_table["Order Status"] != "Fraud")]
cancel_table["Order Status"].value_counts()
Out[61]:
Cancelled    5532
Active        196
Name: Order Status, dtype: int64

Assumption 1:

Assuming Terminated and Cancelled status to be the same sicne not enough context on dataset to consider them seperately

In [62]:
# generalizing cancelled and terminated status of the order (Assumption - 1)
def terminated_to_cancelled(value):
    if value == "Terminated":
        new_value = "Cancelled"
    else:
        new_value = value
    
    return new_value

cancel_table["Order Status"] = cancel_table["Order Status"].apply(terminated_to_cancelled)
cancel_table["Order Status"].value_counts()
Out[62]:
Cancelled    5532
Active        196
Name: Order Status, dtype: int64
In [63]:
# checking for different values within client status
cancel_table["Client Status"].value_counts()
Out[63]:
Inactive    4619
Active      1062
Closed        47
Name: Client Status, dtype: int64

Assumption 2:

Assuming Closed and Inactive status to be the same sicne not enough context on dataset to consider them seperately

In [64]:
# generalizing closed and inactive status of the client (Assumption - 2)
def closed_to_inactive(value):
    if value == "Closed":
        new_value = "Inactive"
    else:
        new_value = value
    
    return new_value

cancel_table["Client Status"] = cancel_table["Client Status"].apply(closed_to_inactive)
cancel_table["Client Status"].value_counts()
Out[64]:
Inactive    4666
Active      1062
Name: Client Status, dtype: int64
In [65]:
cancel_table["country_code"] = cancel_table["country"]
In [66]:
# obtaining country name from country code
def code_to_country(value):
    country = pycountry.countries.get(alpha_2 = value)
    name = country.name
    
    return name 

cancel_table["country"] = cancel_table["country"].apply(code_to_country)

Package Table

In [34]:
# checking for duplicate rows
package_duplicates = package_table[package_table.duplicated()]
package_duplicates
Out[34]:
CustomerID name category type country category.1
715 193421 No Plan Selected rotate standard us static
738 200421 No Plan Selected rotate standard br static
742 214729 No Plan Selected rotate standard de static
753 201261 No Plan Selected rotate standard us static
759 134502 No Plan Selected rotate standard us rotate
... ... ... ... ... ... ...
7491 331797 No Plan Selected rotate standard ca static
7492 331797 No Plan Selected rotate standard gb static
7514 48346 No Plan Selected rotate standard us static
7553 273410 No Plan Selected rotate standard us static
7554 273410 No Plan Selected rotate standard us rotate

1442 rows × 6 columns

In [35]:
# dropping duplicate rows
package_table = package_table.drop_duplicates()
In [36]:
# renaming the redundant category columns
package_table.rename(columns = {"category": "proxy_category", "category.1": "history_category"}, inplace = True)
package_table.head()
Out[36]:
CustomerID name proxy_category type country history_category
0 35200 10 Ports Monthly rotate de-rotate de rotate
1 47037 10 ports Monthly static us-static us static
2 51755 5 Port Proxy Plan semi-3 us-semi-3 us semi-3
3 55801 25 Ports Monthly static us-static us static
4 65531 10 ports Monthly static us-static us static
In [37]:
# getting rid of wrong naming of the name columns
def renaming_names(value):
    if value == "10 ports Monthly":
        new_value = "10 Ports Monthly"
    elif value == "5 Port Proxy Plan":
        new_value = "5 Ports Monthly"
    elif value == "No Plan Selected":
        new_value = "Default"
    else:
        new_value = value
    
    new_value = re.sub(r"\s+", "_", new_value)
    
    return new_value

package_table["name"] = package_table["name"].apply(renaming_names)
package_table["name"].value_counts()
Out[37]:
Default               6105
10_Ports_Monthly         6
5_Ports_Monthly          6
50_Ports_Monthly         4
25_Ports_Monthly         4
1000_Ports_Monthly       1
100_Ports_Monthly        1
Name: name, dtype: int64
In [38]:
# obtaining country name from country code
def code_to_country(value):
    country = pycountry.countries.get(alpha_2 = value)
    name = country.name
    
    return name 

package_table["country_name"] = package_table["country"].apply(code_to_country)

Exploratory Data Analysis

Cacellation Table

1. Is there an yearly increase in the number of new customers and orders?

In [39]:
# overlapping graph for orders and customers over years
grouped = cancel_table.groupby("userid")["customer_since"].min().reset_index()
grouped["start_year"] = grouped["customer_since"].dt.year
customer_grouped = grouped.groupby("start_year")["userid"].count().reset_index()

grouped = cancel_table.copy()
grouped["start_year"] = grouped["customer_since"].dt.year
year_grouped = grouped.groupby("start_year")["orderid"].count().reset_index()

trace1 = go.Scatter(
    x = year_grouped["start_year"],
    y = year_grouped["orderid"],
    name = "orders"
)

trace2 = go.Bar(
    x = customer_grouped["start_year"],
    y = customer_grouped["userid"],
    name = "customers"
    
)

data = [trace1, trace2]
fig = go.Figure(data = data)
fig.show()

Interpretation

  1. It can be seen that the number of orders is increasing with the number of customers over the years.
  2. This indicates positive growth in the company.
  3. The number of orders and customers for the year 2020 in the visualization should not be considered for now since we do not have data pertaining to all months in 2020 yet.

Where are our customer coming from?

In [40]:
# geo map for customers from different countries
import plotly.express as px

df = cancel_table.groupby("country")["userid"].nunique().reset_index()
fig = px.choropleth(df, locations="country",
                    color="userid", 
                    locationmode = "country names",
                    color_continuous_scale=px.colors.sequential.Plasma
                   )
fig.show()
In [41]:
# setting up the data to plot a pie chart
df.sort_values("userid", inplace = True, ascending = False)
pull = [0.1]
for i in range(109):
    pull.append(0)
    
fig = go.Figure(data=[go.Pie(labels=df["country"], values=df["userid"],
                             pull = pull
                            )])
fig.show()

Interpretation

  1. 34.8% of the entire customer base is from the United States.
  2. This makes sense since most number of orders are from customers in the United States.
  3. This can be seen in the following visualizaton.

Where are our orders coming from?

In [42]:
# checking where our main source of orders is from
df = cancel_table.groupby("country")["orderid"].count().reset_index()
fig = px.choropleth(df, locations="country",
                    color="orderid", # lifeExp is a column of gapminder
                    locationmode = "country names",
                    color_continuous_scale=px.colors.sequential.Plasma
                   )
fig.show()
In [43]:
# setting up the data to plot a pie chart
df.sort_values("orderid", inplace = True, ascending = False)
pull = [0.1]
for i in range(109):
    pull.append(0)
    
fig = go.Figure(data=[go.Pie(labels=df["country"], values=df["orderid"],
                             pull = pull
                            )])
fig.show()

Interpretation

  1. As expected, United States contributes to 33.3% of the entire orders received.
  2. This visualization only shows the orders received.
  3. We need to check if the clients making the orders continue their subscription to the services.
  4. This can bee seen in the following visualization below.
In [67]:
# Comparing us and other countries
df = cancel_table.copy()

def us_or_not(value):
    if value == "US":
        new_value = 1
    else:
        new_value = 0
    
    return new_value

df["US"] = df["country_code"].apply(us_or_not) 

# churn rate for us and non-us customers
grouped = df.groupby(["US", "userid"])
sub_grouped = pd.DataFrame()
for name, group in grouped:
    group.sort_values("cancel_date", inplace = True, ascending = False)
    new = group[["US", "userid", "country", "Client Status"]].head(1)
    sub_grouped = sub_grouped.append(new, ignore_index = True)
    
new_grouped = sub_grouped.groupby("US")
churn = pd.DataFrame(columns = ["country", "rate"])
for name, group in new_grouped:
    if name == 1:
        country = "US"
    else:
        country = "Non-US"
        
    churn_rate = len(group[group["Client Status"] == "Inactive"]) / len(group)
    churn = churn.append({"country": country,
                          "rate": churn_rate}, ignore_index=True)

fig = px.bar(churn, x = "country", y = "rate", color = "country")
fig.show()

Interpreation

  1. The churn rate of customers from United States is higher than the combined curn rate of the customers from remaining countries.
  2. This is bad since we are losing customers from our most popular customer base, we need to build a strategy to retain these customers.
  3. This could be done on further analysis of the products that customers are ordering.

What are our top 10 products?

In [68]:
# identifying top 10 prodcuts ordered
df = cancel_table.copy()
grouped = df.groupby("product")["orderid"].count().reset_index()
grouped.sort_values("orderid", ascending = False, inplace = True)
grouped = grouped.iloc[0:10, :]

fig = px.bar(grouped, x = "product", y = "orderid")
fig.show()

Are we doing well in retaining customer interest in the top 10 products?

In [71]:
# dataframe defining overall statistics of each product type
df = cancel_table.copy()
grouped = df.groupby("product")
new_df = pd.DataFrame(columns = ["Product", "Total_Count", "Active_Count", "Cancelled_Count", "Churn_Rate", "Retention_Rate"])
for name, group in grouped:
    total_count = len(group)
    active_count = len(group[group["Order Status"] == "Active"])
    cancelled_count = len(group[group["Order Status"] == "Cancelled"])
    churn_rate = cancelled_count/total_count
    retention_rate = active_count/total_count
    new_df = new_df.append({"Product": name,
                            "Total_Count": total_count,
                            "Active_Count": active_count,
                            "Cancelled_Count": cancelled_count,
                            "Churn_Rate": churn_rate,
                            "Retention_Rate": retention_rate}, ignore_index = True)

new_df.sort_values("Total_Count", ascending = False, inplace = True)
new_df = new_df.iloc[:10, :]

trace1 = go.Bar(
    x = new_df["Product"],
    y = new_df["Total_Count"],
    name = "Total Orders"
)
trace2 = go.Bar(
    x = new_df["Product"],
    y = new_df["Cancelled_Count"],
    name = "Cancelled"
)

trace3 = go.Bar(
    x = new_df["Product"],
    y = new_df["Active_Count"],
    name = "Active"
)

data = [trace1, trace2, trace3]
fig = go.Figure(data = data)
fig.show()

Interpretation

  1. The ratio of active to cancelled orders is shockingly less.
  2. This shows that, eventhough we are getting new customers and orders, the problem we are facing is very low retention rate.
  3. This is something that would need immediate attention a small increase in retention rate can lead to huge increase in revenue.
  4. This could also be a factor due lagre number of 'Trial' and not 'Paying' subscribers

Package Table

In [73]:
# pie chart for category
data = package_table.copy()
df = data.groupby("history_category")["CustomerID"].count().reset_index()
df.sort_values("CustomerID", ascending = False, inplace = True)
pull = [0.1, 0, 0, 0, 0, 0, 0]
fig = go.Figure(data=[go.Pie(labels=df["history_category"], values=df["CustomerID"],
                             pull = pull
                            )])
fig.show()

Interpretation

  1. Static dominates the proxy category chosen by customers with a wooping contribution of 62.7%.
  2. Semi-3 and rotate are chosen moderately while the are other categories are rarely chosen.

Which data centers handle the most traffic?

In [75]:
# geo map for data centers
import plotly.express as px

df = package_table.groupby("country_name")["CustomerID"].count().reset_index()
fig = px.choropleth(df, locations="country_name",
                    color="CustomerID",
                    locationmode = "country names",
                    color_continuous_scale=px.colors.sequential.Plasma
                   )
fig.show()
In [76]:
data = package_table.copy()
df = data.groupby("country_name")["CustomerID"].count().reset_index()
df.sort_values("CustomerID", ascending = False, inplace = True)
pull = [0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
fig = go.Figure(data=[go.Pie(labels=df["country_name"], values=df["CustomerID"],
                             pull = pull
                            )])
fig.show()

Interpretation

  1. Most of our data centers are located in the United States.
  2. This is most likely because of the high number of customers and orders in the region.
In [77]:
data = package_table.copy()
df = data.groupby("name")["CustomerID"].count().reset_index()
fig = go.Figure(data=[go.Pie(labels=df["name"], values=df["CustomerID"],
                            )])
fig.show()

Interpretation

  1. Default package seems to be the one that is mostly chosen for the orders made by customers.
  2. This might be due to high quality, low cost, good service rate of the package but cannot be particulary specified now due to lack of context on the data.

Customer Segmentation

Grouping by customer to find key attributes such as their tenure with the company, revenue generated and total number of orders

In [79]:
# defining function to get overall revenue for an order
def revenue(df):
    if df["billingcycle"] == "Monthly":
        value = df["amount"] * 12
    elif df["billingcycle"] == "Quarterly":
        value = df["amount"] * 4
    elif df["billingcycle"] == "Semi-Annually":
        value = df["amount"] * 2
    else:
        value = df["amount"]
    
    return value

# obtaing the dataframe determining the tenure, no. of orders and revenue of each customer
df = cancel_table.copy()
df["revenue"] = df.apply(revenue, axis=1)
grouped = df.groupby("userid")
new_df = pd.DataFrame(columns = ["userid", "tenure", "orders", "revenue"])
for name, group in grouped:
    revenue = group["revenue"].sum()
    tenure = group["period"].sum()
    orders = group["orderid"].count()
    new_df = new_df.append({"userid": name,
                            "tenure": tenure,
                            "orders": orders,
                            "revenue": revenue}, ignore_index = True)

new_df.head()
Out[79]:
userid tenure orders revenue
0 101567 1167.0 1 72.00
1 105928 1215.0 1 270.00
2 106684 975.0 1 288.00
3 107013 2307.0 2 280.08
4 107468 1152.0 1 72.00

The objective here is to use tenure, number of orders and revenue of each customer to build some sort of metric to assess the innate value of each customer to the organization

Tenure

In [80]:
# histogram for tenure distribution
plot_data = [
    go.Histogram(
        x=new_df['tenure']
    )
]

plot_layout = go.Layout(
        title='Tenure'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
fig.show()
In [82]:
# removing outliers for better visual representation
new_df1 = new_df[new_df["tenure"] <= 4000]
In [83]:
# histogram for tenure distribution
plot_data = [
    go.Histogram(
        x=new_df1['tenure']
    )
]

plot_layout = go.Layout(
        title='Tenure'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
fig.show()
In [84]:
# creating elbow plot to decide on right number of clusters
sse={}
tenure = new_df[['tenure']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(tenure)
    tenure["clusters"] = kmeans.labels_
    sse[k] = kmeans.inertia_ 
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()

Interpretation

  1. The right number of clusters is somewhere in the range of 4 to 6
In [85]:
# using k-means to assign clusters
kmeans = KMeans(n_clusters=4)
kmeans.fit(new_df[['tenure']])
new_df['TenureCluster'] = kmeans.predict(new_df[['tenure']])
In [86]:
# describing cluster info
new_df.groupby("TenureCluster")["tenure"].describe()
Out[86]:
count mean std min 25% 50% 75% max
TenureCluster
0 3681.0 106.154849 134.793368 0.0 3.0 39.0 160.0 522.0
1 759.0 937.362319 328.804935 523.0 665.5 855.0 1131.5 1880.0
2 2.0 15080.000000 5902.927409 10906.0 12993.0 15080.0 17167.0 19254.0
3 91.0 2838.098901 1100.121943 1895.0 2117.5 2467.0 3103.5 7298.0
In [87]:
# defining a function to order the cluster numbers
def order_cluster(cluster_field_name, target_field_name,df,ascending):
    new_cluster_field_name = 'new_' + cluster_field_name
    df_new = df.groupby(cluster_field_name)[target_field_name].mean().reset_index()
    df_new = df_new.sort_values(by=target_field_name,ascending=ascending).reset_index(drop=True)
    df_new['index'] = df_new.index
    df_final = pd.merge(df,df_new[[cluster_field_name,'index']], on=cluster_field_name)
    df_final = df_final.drop([cluster_field_name],axis=1)
    df_final = df_final.rename(columns={"index":cluster_field_name})
    return df_final
In [88]:
# ordering the cluster numbers
new_df = order_cluster('TenureCluster', "tenure", new_df, False)
new_df.groupby("TenureCluster")["tenure"].describe()
Out[88]:
count mean std min 25% 50% 75% max
TenureCluster
0 2.0 15080.000000 5902.927409 10906.0 12993.0 15080.0 17167.0 19254.0
1 91.0 2838.098901 1100.121943 1895.0 2117.5 2467.0 3103.5 7298.0
2 759.0 937.362319 328.804935 523.0 665.5 855.0 1131.5 1880.0
3 3681.0 106.154849 134.793368 0.0 3.0 39.0 160.0 522.0

Number of Orders

In [89]:
# converting ddata type of orders
new_df["orders"] = new_df["orders"].astype(int)

# histogram for orders distribution
plot_data = [
    go.Histogram(
        x=new_df['orders']
    )
]

plot_layout = go.Layout(
        title='Orders'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
fig.show()
In [90]:
# creating elbow plot to decide on right number of clusters
sse={}
orders = new_df[['orders']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(orders)
    orders["clusters"] = kmeans.labels_
    sse[k] = kmeans.inertia_ 
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()

Interpretation

  1. The right number of clusters is somewhere in the range of 4 to 6
In [91]:
# using k-means to assign clusters
kmeans = KMeans(n_clusters=4)
kmeans.fit(new_df[['orders']])
new_df['OrderCluster'] = kmeans.predict(new_df[['orders']])
In [92]:
# describing cluster info
new_df.groupby("OrderCluster")["orders"].describe()
Out[92]:
count mean std min 25% 50% 75% max
OrderCluster
0 714.0 2.197479 0.398376 2.0 2.0 2.0 2.00 3.0
1 3739.0 1.000000 0.000000 1.0 1.0 1.0 1.00 1.0
2 78.0 4.923077 1.235505 4.0 4.0 4.0 5.75 9.0
3 2.0 18.000000 4.242641 15.0 16.5 18.0 19.50 21.0
In [93]:
# ordering the cluster numbers
new_df = order_cluster('OrderCluster', 'orders', new_df, False)
new_df.groupby("OrderCluster")["orders"].describe()
Out[93]:
count mean std min 25% 50% 75% max
OrderCluster
0 2.0 18.000000 4.242641 15.0 16.5 18.0 19.50 21.0
1 78.0 4.923077 1.235505 4.0 4.0 4.0 5.75 9.0
2 714.0 2.197479 0.398376 2.0 2.0 2.0 2.00 3.0
3 3739.0 1.000000 0.000000 1.0 1.0 1.0 1.00 1.0

Revenue

In [94]:
# histogram for revenue distribution
plot_data = [
    go.Histogram(
        x=new_df['revenue']
    )
]

plot_layout = go.Layout(
        title='Revenue'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
fig.show()
In [95]:
# removing outliers for better visual representation
new_df1 = new_df[new_df["revenue"] <= float(10000)]

# histogram for revenue distribution
plot_data = [
    go.Histogram(
        x=new_df1['revenue']
    )
]

plot_layout = go.Layout(
        title='Revenue'
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
fig.show()
In [96]:
# creating elbow plot to decide on right number of clusters
sse={}
revenue = new_df[['revenue']]
for k in range(1, 10):
    kmeans = KMeans(n_clusters=k, max_iter=1000).fit(revenue)
    revenue["clusters"] = kmeans.labels_
    sse[k] = kmeans.inertia_ 
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()

Interpretation

  1. The right number of clusters is somewhere in the range of 3 to 5
In [97]:
# using k-means to assign clusters
kmeans = KMeans(n_clusters=3)
kmeans.fit(new_df[['revenue']])
new_df['RevenueCluster'] = kmeans.predict(new_df[['revenue']])
In [98]:
# describing cluster info
new_df.groupby("RevenueCluster")["revenue"].describe()
Out[98]:
count mean std min 25% 50% 75% max
RevenueCluster
0 4531.0 398.643099 1311.74576 0.0 72.0 132.0 276.0 32400.0
1 1.0 660000.000000 NaN 660000.0 660000.0 660000.0 660000.0 660000.0
2 1.0 97449.600000 NaN 97449.6 97449.6 97449.6 97449.6 97449.6
In [99]:
# ordering cluster numbers
new_df = order_cluster('RevenueCluster', 'revenue', new_df, False)
new_df.groupby("RevenueCluster")["revenue"].describe()
Out[99]:
count mean std min 25% 50% 75% max
RevenueCluster
0 1.0 660000.000000 NaN 660000.0 660000.0 660000.0 660000.0 660000.0
1 1.0 97449.600000 NaN 97449.6 97449.6 97449.6 97449.6 97449.6
2 4531.0 398.643099 1311.74576 0.0 72.0 132.0 276.0 32400.0

Interpretation

  1. As you can see, there are only one customer each in the top two clusters.
  2. This bias is caused only because of data insufficiency in terms of revenue clusters.
  3. To avoid further bias in overall score, I have taken 5 clusters to disperse the customers more.
In [100]:
# using k-means to assign clusters
kmeans = KMeans(n_clusters=5)
kmeans.fit(new_df[['revenue']])
new_df['RevenueCluster'] = kmeans.predict(new_df[['revenue']])
In [101]:
# describing cluster info
new_df.groupby("RevenueCluster")["revenue"].describe()
Out[101]:
count mean std min 25% 50% 75% max
RevenueCluster
0 4447.0 265.872121 396.689576 0.00 72.00 120.00 257.1 2952.0
1 1.0 660000.000000 NaN 660000.00 660000.00 660000.00 660000.0 660000.0
2 1.0 97449.600000 NaN 97449.60 97449.60 97449.60 97449.6 97449.6
3 8.0 23582.100000 4760.871312 15096.96 22050.00 24006.12 24471.9 32400.0
4 76.0 5727.128421 2310.405215 3104.40 3764.13 4800.00 7200.0 12000.0
In [102]:
# ordering cluster numbers
new_df = order_cluster('RevenueCluster', 'revenue', new_df, False)
new_df.groupby("RevenueCluster")["revenue"].describe()
Out[102]:
count mean std min 25% 50% 75% max
RevenueCluster
0 1.0 660000.000000 NaN 660000.00 660000.00 660000.00 660000.0 660000.0
1 1.0 97449.600000 NaN 97449.60 97449.60 97449.60 97449.6 97449.6
2 8.0 23582.100000 4760.871312 15096.96 22050.00 24006.12 24471.9 32400.0
3 76.0 5727.128421 2310.405215 3104.40 3764.13 4800.00 7200.0 12000.0
4 4447.0 265.872121 396.689576 0.00 72.00 120.00 257.1 2952.0
In [103]:
# displaying data with all clusters assigned
new_df.head()
Out[103]:
userid tenure orders revenue TenureCluster OrderCluster RevenueCluster
0 101567 1167.0 1 72.0 2 3 4
1 105928 1215.0 1 270.0 2 3 4
2 106684 975.0 1 288.0 2 3 4
3 107468 1152.0 1 72.0 2 3 4
4 108819 868.0 1 696.0 2 3 4

Overall Score

In [104]:
# calculating overall score
new_df['OverallScore'] = new_df['TenureCluster'] + new_df['OrderCluster'] + new_df['RevenueCluster']
new_df.groupby('OverallScore')['tenure','orders','revenue'].mean().reset_index()
Out[104]:
OverallScore tenure orders revenue
0 3 15080.000000 15.000000 15578.280000
1 4 3398.500000 10.000000 20087.280000
2 5 2619.400000 4.400000 81565.104000
3 6 2832.029412 4.705882 3460.089412
4 7 2069.650602 2.783133 1875.990843
5 8 945.698113 2.424528 1237.712642
6 9 523.335583 1.519769 491.908814
7 10 100.295909 1.000000 200.083895

Interpretation

  1. Right now I have taken equal weights of all 3 groups of clusters and to calculate a score for "customer value"
  2. But ideally, it would be better to take a weighted estimate of each cluster (such as more importance to revenue or orders or tenure depending on what we're looking to improve).
  3. The scoring here might be affected due to insufficiency of data which is causing irregualr clusters in terms of revenue.
In [108]:
# defining a function to assign clusters
def segments(score):
    if score in range(3, 7):
        segment = "High Value Customers"
    elif score in range(7, 9):
        segment = "Mid Value Customers"
    else:
        segment = "Low Value Customers"
    
    return segment

new_df["Segment"] = new_df["OverallScore"].apply(segments)
new_df.head()
Out[108]:
userid tenure orders revenue TenureCluster OrderCluster RevenueCluster OverallScore Segment
0 101567 1167.0 1 72.0 2 3 4 9 Low Value Customers
1 105928 1215.0 1 270.0 2 3 4 9 Low Value Customers
2 106684 975.0 1 288.0 2 3 4 9 Low Value Customers
3 107468 1152.0 1 72.0 2 3 4 9 Low Value Customers
4 108819 868.0 1 696.0 2 3 4 9 Low Value Customers
In [109]:
# viewing customer counts of different segments.
new_df["Segment"].value_counts()
Out[109]:
Low Value Customers     4190
Mid Value Customers      295
High Value Customers      44
Name: Segment, dtype: int64
In [110]:
# ignoring outliers(very high values) for clearer visualizaton of segments
new_df = new_df.query("revenue < 80000 and tenure < 10000")

# plotting customer segments for revenue vs tenure
plot_data = [
    go.Scatter(
        x=new_df.query("Segment == 'Low Value Customers'")['tenure'],
        y=new_df.query("Segment == 'Low Value Customers'")['revenue'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           )
    ),
        go.Scatter(
        x=new_df.query("Segment == 'Mid Value Customers'")['tenure'],
        y=new_df.query("Segment == 'Mid Value Customers'")['revenue'],
        mode='markers',
        name='Mid',
        marker= dict(size= 9,
            line= dict(width=1),
            color= 'green',
            opacity= 0.5
           )
    ),
        go.Scatter(
        x=new_df.query("Segment == 'High Value Customers'")['tenure'],
        y=new_df.query("Segment == 'High Value Customers'")['revenue'],
        mode='markers',
        name='High',
        marker= dict(size= 11,
            line= dict(width=1),
            color= 'red',
            opacity= 0.9
           )
    ),
]

plot_layout = go.Layout(
        yaxis= {'title': "Revenue"},
        xaxis= {'title': "Tenure"},
        title='Customer Segments - Revenue VS Tenure',
        autosize=False, width=800, height=700
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
fig.show()
In [111]:
# plotting customer segments for revenue vs order
plot_data = [
    go.Scatter(
        x=new_df.query("Segment == 'Low Value Customers'")['orders'],
        y=new_df.query("Segment == 'Low Value Customers'")['revenue'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           )
    ),
        go.Scatter(
        x=new_df.query("Segment == 'Mid Value Customers'")['orders'],
        y=new_df.query("Segment == 'Mid Value Customers'")['revenue'],
        mode='markers',
        name='Mid',
        marker= dict(size= 9,
            line= dict(width=1),
            color= 'green',
            opacity= 0.5
           )
    ),
        go.Scatter(
        x=new_df.query("Segment == 'High Value Customers'")['orders'],
        y=new_df.query("Segment == 'High Value Customers'")['revenue'],
        mode='markers',
        name='High',
        marker= dict(size= 11,
            line= dict(width=1),
            color= 'red',
            opacity= 0.9
           )
    ),
]

plot_layout = go.Layout(
        yaxis= {'title': "Revenue"},
        xaxis= {'title': "Orders"},
        title='Customer Segments - Revenue VS Orders',
        autosize=False, width=800, height=700
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
fig.show()
In [112]:
# plotting customer segments for orders vs tenure
plot_data = [
    go.Scatter(
        x=new_df.query("Segment == 'Low Value Customers'")['orders'],
        y=new_df.query("Segment == 'Low Value Customers'")['tenure'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           )
    ),
        go.Scatter(
        x=new_df.query("Segment == 'Mid Value Customers'")['orders'],
        y=new_df.query("Segment == 'Mid Value Customers'")['tenure'],
        mode='markers',
        name='Mid',
        marker= dict(size= 9,
            line= dict(width=1),
            color= 'green',
            opacity= 0.5
           )
    ),
        go.Scatter(
        x=new_df.query("Segment == 'High Value Customers'")['orders'],
        y=new_df.query("Segment == 'High Value Customers'")['tenure'],
        mode='markers',
        name='High',
        marker= dict(size= 11,
            line= dict(width=1),
            color= 'red',
            opacity= 0.9
           )
    ),
]

plot_layout = go.Layout(
        yaxis= {'title': "Tenure"},
        xaxis= {'title': "Orders"},
        title='Customer Segments - Tenure VS Orders',
        autosize=False, width=800, height=700
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
fig.show()

Churn Prediction

The objective here is to create a model to classify the Status of a client (Active/Inactive) so that it can be utizlized it the future to assess if an Active customer will potentially become an Inactive one.

In [113]:
df = pd.read_csv("D:\\EverythingDS\\DataSets\\model_data.csv")
df.head()
Out[113]:
userid country_code orders revenue tenure unique_product client_status
0 101567 PT 1 72.00 1167 1 Inactive
1 105928 US 1 270.00 1215 1 Inactive
2 106684 US 1 288.00 975 1 Inactive
3 107013 ES 2 280.08 2307 2 Inactive
4 107468 IT 1 72.00 1152 1 Inactive
In [114]:
# Mapping the Client Status - Active to 0 and Inactive to 1
status_map = {'Active' : 0 , 'Inactive' : 1}
df['client_status'] = df['client_status'].map(status_map)
In [115]:
# We have a quite a high imbalance in the dataset with a majority of Inactive customers
df["client_status"].value_counts()
Out[115]:
1    3825
0     708
Name: client_status, dtype: int64
In [116]:
# Checking the Data type of the features
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4533 entries, 0 to 4532
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   userid          4533 non-null   int64  
 1   country_code    4533 non-null   object 
 2   orders          4533 non-null   int64  
 3   revenue         4533 non-null   float64
 4   tenure          4533 non-null   int64  
 5   unique_product  4533 non-null   int64  
 6   client_status   4533 non-null   int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 248.0+ KB
In [117]:
# Segregating the features to streamline preprocessing pipeline
# numerical: discrete vs continuous
discrete = [var for var in df.columns if df[var].dtype!='O' and var!='y' and df[var].nunique()<10]

continuous = [var for var in df.columns if df[var].dtype!='O' and var!='y' and var not in discrete]

# categorical
categorical = [var for var in df.columns if df[var].dtype=='O' and var!='y']

print(f'There are {len(discrete)} discrete variables : ',discrete)
print(f'There are {len(continuous)} continuous variables : ', continuous)
print(f'There are {len(categorical)} categorical variables : ',categorical)
There are 2 discrete variables :  ['unique_product', 'client_status']
There are 4 continuous variables :  ['userid', 'orders', 'revenue', 'tenure']
There are 1 categorical variables :  ['country_code']
In [118]:
# Considering user_id as a continuous variable may impact predictive power of a model and create noise, so it is safer to drop it
df.drop(['userid'], axis = 1, inplace = True)
In [119]:
continuous = [var for var in df.columns if df[var].dtype!='O' and var!='y' and var not in discrete]
print(f'There are {len(continuous)} continuous variables : ', continuous)
There are 3 continuous variables :  ['orders', 'revenue', 'tenure']
In [120]:
# We are transforming continuous varibales so that we have a distribution tending to be normal
# log1p trasforming the continuous variables since they have 0 values
df['revenue'] = df['revenue'].apply(np.log1p)
df['orders'] = df['orders'].apply(np.log1p)
df['tenure'] = df['tenure'].apply(np.log1p)
In [121]:
# Splitting the dataset into train and test so that we can validate if our model performs well
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('client_status', axis=1), df.client_status, 
                                                    shuffle = True, stratify = df.client_status, 
                                                    test_size=0.2, random_state=0)
In [122]:
prep_pipe = Pipeline([
    # Encoding the country code column
    ('ordinal_enc', OrdinalCategoricalEncoder(variables=categorical, encoding_method='arbitrary')),  
     
    # Scaling
    ("standardscaler" , StandardScaler(with_mean=True, with_std=False)),
    ("robustscaler" , RobustScaler(with_centering=False, with_scaling=True, quantile_range=(0, 100))) 
])
In [123]:
prep_pipe.fit(X_train)
Out[123]:
Pipeline(steps=[('ordinal_enc',
                 OrdinalCategoricalEncoder(encoding_method='arbitrary',
                                           variables=['country_code'])),
                ('standardscaler', StandardScaler(with_std=False)),
                ('robustscaler',
                 RobustScaler(quantile_range=(0, 100), with_centering=False))])
In [124]:
X_train=prep_pipe.transform(X_train)
X_test=prep_pipe.transform(X_test)
In [137]:
cat = CatBoostClassifier()

param_grid = {
            'loss_function' :['Logloss'],
            'depth':[3,1,2,6,4,5,7,8,9,10],
            'iterations':[250,100,500,1000], 
            'learning_rate':[0.03,0.001,0.01,0.1,0.2,0.3], 
            'l2_leaf_reg':[3,1,5,10,100], 
            'border_count':[32,5,10,20,50,100,200],
            # assigning class weights to that we can resolve the imbalance in the target attribute of the dataset
            'class_weights' : [[100,1], [10,80], [1,20], [1,1], [1,10], [1,100], [1,50], [1, 30]]
                        }

# cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
cat_grid = RandomizedSearchCV(cat, param_grid, cv = 5, n_jobs=-1, scoring = 'f1')
cat_grid.fit(X_train, y_train) 
In [126]:
print(f'Best Params: {cat_grid.best_params_}\n')
print(f'Best Mean Cross Validation Score is {cat_grid.best_score_}\n')
print(f'Test score is {cat_grid.score(X_test,y_test)}')
Best Params: {'loss_function': 'Logloss', 'learning_rate': 0.2, 'l2_leaf_reg': 5, 'iterations': 500, 'depth': 4, 'class_weights': [1, 10], 'border_count': 20}

Best Mean Cross Validation Score is 0.9154933178163484

Test score is 0.9150717703349283
In [127]:
from sklearn.metrics import classification_report
y_pred = cat_grid.predict(X_test)
print(classification_report(y_pred,y_test))
              precision    recall  f1-score   support

         0.0       0.00      0.00      0.00         0
         1.0       1.00      0.84      0.92       907

    accuracy                           0.84       907
   macro avg       0.50      0.42      0.46       907
weighted avg       1.00      0.84      0.92       907

In [128]:
# confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)
Out[128]:
array([[  0, 142],
       [  0, 765]], dtype=int64)
In [129]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
print(tn, fp, fn, tp)
0 142 0 765

We have generated a satisfactory score of our chosen metric of about 91%, but as you can see above, we have a significant number of false positive predictions. In the future I hope to resolve this by builing models that factor in the imabalance in the data set (by using Cost Sensitive, Data Sampling and Probability Calibration based models) to optimize the classification.

Future Work

1. Conversion rate from trial to paying
  1. Right now, I tried performing the conversion rate but only found one order converted.
  2. With better and more accurate data, we can determine which products have a high conversion rate as this is a very important to consider in terms of product analysis.
In [130]:
# attempt to find conversion rate
df = cancel_table.copy()
grouped = df.groupby(["userid", "product"])
conversion = pd.DataFrame(columns = ["userid", "product", "status"])
for name, group in grouped:
    group.sort_values("customer_since", inplace = True)
    orders = group["orderid"].count()
    if orders > 1:
        if group["Customer Type"].nunique() > 1:
            type_list = list(group["Customer Type"])
            conv_list = []
            for i in range(len(type_list) - 1):
                if type_list[i] == "Paying" and type_list[i + 1] == "Trial":
                    status = "converted"
                    conv_list.append("converted")
                else:
                    conv_list.append("non-converted")
                    
            if conv_list.count("converted") >=1:
                conversion = conversion.append({"userid": group.iloc[0]["userid"],
                                                    "product": group.iloc[0]["product"],
                                                    "status": status}, ignore_index = True)
                
                
conversion
Out[130]:
userid product status
0 92572 Product 9 converted
2. Joining the cancel and package table to find more relations
  1. I joined the cancel and package table to find any relations.
  2. Tried to see if no. of orders from cancel table matches with number of packages from package table.
  3. There could be a better analysis of this if I were given a more detailed understanding of the products and packages at Sprious.
In [131]:
# function for overall revenue
def revenue(df):
    if df["billingcycle"] == "Monthly":
        value = df["amount"] * 12
    elif df["billingcycle"] == "Quarterly":
        value = df["amount"] * 4
    elif df["billingcycle"] == "Semi-Annually":
        value = df["amount"] * 2
    else:
        value = df["amount"]
    
    return value
# organizing dataset for joining
df = cancel_table.copy()
df["revenue"] = df.apply(revenue, axis=1)
# df = pd.get_dummies(df, columns = ["product"])
grouped = df.groupby("userid")
final_data = pd.DataFrame(columns = ["userid", "country_code", "orders", "revenue", "tenure", "client_status", "product"])
for name, group in grouped:
    country_code = group.iloc[0]["country_code"]
    orders = group["orderid"].count()
    revenue = group["revenue"].sum()
    tenure = group["period"].sum()
    product = group["product"].nunique()
    group.sort_values("cancel_date", ascending = False, inplace = True)
    client_status = group.iloc[0]["Client Status"]
    final_data = final_data.append({"userid": name,
                                    "country_code": country_code,
                                    "orders": orders,
                                    "revenue": revenue,
                                    "tenure": tenure,
                                    "client_status": client_status,
                                    "product": product}, ignore_index = True)
    
final_data.head()
Out[131]:
userid country_code orders revenue tenure client_status product
0 101567 PT 1 72.00 1167.0 Inactive 1
1 105928 US 1 270.00 1215.0 Inactive 1
2 106684 US 1 288.00 975.0 Inactive 1
3 107013 ES 2 280.08 2307.0 Inactive 2
4 107468 IT 1 72.00 1152.0 Inactive 1
In [132]:
# organizing package table for joining
df = pd.get_dummies(package_table, columns = ["name", "history_category"])
df.rename(columns = {"history_category_semi-3" : "history_category_semi"}, inplace = True)
df.head()
Out[132]:
CustomerID proxy_category type country country_name name_1000_Ports_Monthly name_100_Ports_Monthly name_10_Ports_Monthly name_25_Ports_Monthly name_50_Ports_Monthly name_5_Ports_Monthly name_Default history_category_block history_category_mobile history_category_rotate history_category_semi history_category_shopify history_category_sneaker history_category_static
0 35200 rotate de-rotate de Germany 0 0 1 0 0 0 0 0 0 1 0 0 0 0
1 47037 static us-static us United States 0 0 1 0 0 0 0 0 0 0 0 0 0 1
2 51755 semi-3 us-semi-3 us United States 0 0 0 0 0 1 0 0 0 0 1 0 0 0
3 55801 static us-static us United States 0 0 0 1 0 0 0 0 0 0 0 0 0 1
4 65531 static us-static us United States 0 0 1 0 0 0 0 0 0 0 0 0 0 1
In [133]:
# reducing the package table
q = """
SELECT CustomerID, SUM(history_category_block),
       SUM(history_category_mobile), SUM(history_category_rotate),
       SUM(history_category_semi), SUM(history_category_shopify),
       SUM(history_category_sneaker), SUM(history_category_static),
       SUM(name_10_Ports_Monthly), SUM(name_100_Ports_Monthly),
       SUM(name_1000_Ports_Monthly), SUM(name_25_Ports_Monthly),
       SUM(name_5_Ports_Monthly), SUM(name_50_Ports_Monthly),
       SUM(name_Default)
FROM df
GROUP BY CustomerID
"""

result = sqldf(q, globals())
result.head()
Out[133]:
CustomerID SUM(history_category_block) SUM(history_category_mobile) SUM(history_category_rotate) SUM(history_category_semi) SUM(history_category_shopify) SUM(history_category_sneaker) SUM(history_category_static) SUM(name_10_Ports_Monthly) SUM(name_100_Ports_Monthly) SUM(name_1000_Ports_Monthly) SUM(name_25_Ports_Monthly) SUM(name_5_Ports_Monthly) SUM(name_50_Ports_Monthly) SUM(name_Default)
0 1320 0 0 0 1 0 0 1 0 0 0 0 0 0 2
1 1432 0 0 1 0 0 0 1 0 0 0 0 0 0 2
2 2125 0 0 1 0 0 0 0 0 0 0 0 0 0 1
3 3203 0 0 0 1 0 0 1 0 0 0 0 0 0 2
4 4099 0 0 2 1 0 0 3 0 0 0 0 0 0 6
In [134]:
# calculating total packages
result["packages"] = result["SUM(name_10_Ports_Monthly)"] + result["SUM(name_100_Ports_Monthly)"] \
                    + result["SUM(name_1000_Ports_Monthly)"] + result["SUM(name_25_Ports_Monthly)"] \
                    + result["SUM(name_5_Ports_Monthly)"] + result["SUM(name_50_Ports_Monthly)"] \
                    + result["SUM(name_Default)"]

result.head()  
Out[134]:
CustomerID SUM(history_category_block) SUM(history_category_mobile) SUM(history_category_rotate) SUM(history_category_semi) SUM(history_category_shopify) SUM(history_category_sneaker) SUM(history_category_static) SUM(name_10_Ports_Monthly) SUM(name_100_Ports_Monthly) SUM(name_1000_Ports_Monthly) SUM(name_25_Ports_Monthly) SUM(name_5_Ports_Monthly) SUM(name_50_Ports_Monthly) SUM(name_Default) packages
0 1320 0 0 0 1 0 0 1 0 0 0 0 0 0 2 2
1 1432 0 0 1 0 0 0 1 0 0 0 0 0 0 2 2
2 2125 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1
3 3203 0 0 0 1 0 0 1 0 0 0 0 0 0 2 2
4 4099 0 0 2 1 0 0 3 0 0 0 0 0 0 6 6
In [135]:
# joining two tables
q = """
SELECT fd.userid, fd.orders, fd.product, rd.packages
FROM final_data as fd
INNER JOIN result as rd
ON fd.userid = rd.CustomerID
"""

joined = sqldf(q, globals())
joined.head()
Out[135]:
userid orders product packages
0 101567 1 1 1
1 105928 1 1 1
2 106684 1 1 1
3 107013 2 2 2
4 107468 1 1 1
In [136]:
# list containing matches
match = list(joined["product"] == joined["packages"])
print(len(match))
print(match.count(True))
4098
3300